Availability: Adds logic to avoid bad replica during cache refresh #3127

j82w · 2022-04-01T11:59:06Z

Pull Request Template

Description

Current design:
If the SDK get's a 410 or other failure that signals a replica has moved to a different machine an address cache refresh is triggered. The cache refresh returns the stale information until the new address list is returned, because the 3 other replicas should still be valid and can complete requests. This still gives a 25% chance of a new replica going to the bad replica which can possibly take multiple seconds for the connection to timeout.

The solution:
The GatewayAddressCache individual addresses have a unhealthy flag. When a cache refresh is requested the bad replica will be marked as unhealthy. When the SDK goes to pick a random replica it will always move the unhealthy replicas to the end of the list. When the results from the gateway return the health state is reset. It will only avoid the replica during call to get the new addresses from the gateway.

The "unhealthy" state would be reset when a Gateway refresh response comes back (whether addresses changed or not) or after 1 minute - whatever comes first. So the throughput SLA regression risk (temporarily only using 3 out of 4 replica) is only applicable for at most 1 minute.

// Cache refresh design

sequenceDiagram
    participant Request1
    participant Request2
    Request1->>+GatewayAddressCache: Get partition key range 0
    GatewayAddressCache->>-Request1: Returns replicas[1,2,3,4]
    Request1->>+Replica2: Get item
    Replica2->>-Request1: Gone(410) represent replica moved
    Request1->>GatewayAddressCache: Start background refresh of addresses
    Request1->>+Replica1: Get item
    Replica1->>-Request1: Returns item
    GatewayAddressCache->>+Cosmos Gateway: Get addresses for range 0 with ForceRefresh
    Request2->>+GatewayAddressCache: Get partition key range 0 
    GatewayAddressCache->>-Request2: Stale [1,2,3,4]
    Request2->>+Replica3: Replica1 has a 0% chance of being picked since refresh is still occurring. It use to be 25%. 
    Replica3->>-Request2: Returns item
    Cosmos Gateway->>-GatewayAddressCache: Return addresses [5,2,3,4]

Type of change

Please delete options that are not relevant.

[] Bug fix (non-breaking change which fixes an issue)
[] New feature (non-breaking change which adds functionality)
[] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[] This change requires a documentation update

Closing issues

To automatically close an issue: closes #IssueNumber

…voidUnhealthyReplica

Microsoft.Azure.Cosmos/src/Routing/GatewayAddressCache.cs

FabianMeiswinkel

LGTM - Thanks

ealsur

LGTM, just a nit on the diagram. Request2 has a 410 response also on Replica3, should it be a 200/201?

Jake Willey added 2 commits March 31, 2022 18:55

Avoid unhealthy replica during cache refresh

6a6b0b4

Merge remote-tracking branch 'origin/master' into users/jawilley/ha/a…

4ee2f14

…voidUnhealthyReplica

j82w requested review from khdang, sboshra, neildsh, kirankumarkolli, ealsur, FabianMeiswinkel and kirillg as code owners April 1, 2022 11:59

FabianMeiswinkel reviewed Apr 1, 2022

View reviewed changes

Microsoft.Azure.Cosmos/src/Routing/GatewayAddressCache.cs Show resolved Hide resolved

Improve test reliability

8572902

FabianMeiswinkel approved these changes Apr 1, 2022

View reviewed changes

ealsur approved these changes Apr 1, 2022

View reviewed changes

j82w enabled auto-merge (squash) April 1, 2022 13:56

j82w merged commit 0aa0456 into master Apr 1, 2022

j82w deleted the users/jawilley/ha/avoidUnhealthyReplica branch April 1, 2022 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Availability: Adds logic to avoid bad replica during cache refresh #3127

Availability: Adds logic to avoid bad replica during cache refresh #3127

j82w commented Apr 1, 2022 •

edited

Loading

FabianMeiswinkel left a comment

ealsur left a comment

Availability: Adds logic to avoid bad replica during cache refresh #3127

Availability: Adds logic to avoid bad replica during cache refresh #3127

Conversation

j82w commented Apr 1, 2022 • edited Loading

Pull Request Template

Description

Type of change

Closing issues

FabianMeiswinkel left a comment

Choose a reason for hiding this comment

ealsur left a comment

Choose a reason for hiding this comment

j82w commented Apr 1, 2022 •

edited

Loading